Fighting Overfit with PreciseDist

“Everybody has their taste in noises as well as in other matters; and sounds are quite innoxious, or most distressing, by their sort rather than their quantity.”

-Jane Austen, Persuasion, p 160

Data set-up

In our minds, the first step of cluster validation is a validation of the underlying structure inherent in the data. Of course, a clustering algorithm is guaranteed to find clusters, so we need to know if there is likely to be anything worth finding in the first place. And while visualizing structure in a dataset is not guaranteed to represent that structure faithfully, it is often a good, logical first step. So, this is where we will begin.

Note: While this data-set up is very similar to the setup used in the Cell Cycle Vignette - Experiment 5: Minkowski 100x and other vignettes, it is not exactly the same. Here, we are inputing 10 minkowski distance functions into precise_dist(), but we are setting partitions = NULL.

library(PreciseDist)
data("data_cell_cycle")
str(data_cell_cycle[1:5])
library(dplyr)
cell_cycle_data <- data_cell_cycle %>%
  dplyr::select(-Cell_cycle) %>%
  as.matrix()
cell_cycle_labels <- data_cell_cycle %>%
  dplyr::select(Cell_cycle) %>%
  as.matrix()

cell_cycle_minkowski_params <- seq(0.45, 0.54, length.out = 10)
cell_cycle_minkowski_funcs <- precise_func_fact(
  func = "minkowski",
  params = cell_cycle_minkowski_params
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_minkowski_dists <- cell_cycle_data %>%
  as.matrix() %>%
  precise_dist(
    dist_funcs = cell_cycle_minkowski_funcs,
    time_series = FALSE,
    partitions = 1,
    suffix = "cell_minkowski_",
    file = "/absolute_path/to_somewhere/with_full_name/inclusing_the/file_extension.rds",
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(transform = "laplacian")
cell_cycle_minkowski_fused <- precise_fusion(
  cell_cycle_minkowski_transformed,
  fusion = "fuse",
  verbose = TRUE
)

Now, we will view the results as we typically do:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 8)
graph_mm <- precise_graph(
  data = cell_cycle_minkowski_fused,
  method = 1,
  n_neighbors = 50,
  spread = 1,
  min_dist = 0.001,
  bandwidth = 10,
  parallel = TRUE,
  verbose = FALSE
)
viz_graph_mm <- precise_viz(
  data = graph_mm,
  plot_type = "fr_3d_graph",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = FALSE
)
viz_graph_mm$visual_output

OK, good, we have a graph with three distinct hubs, and because we are working with cell cycle data, so far it seems very likely we are on the right path. But, how do we know for sure that we are not just overfitting noise? Although the structure seems too distinct for that to be the case, we would still like to know how strong the visualized relationships are. One way to test that is to preturb the input to see if our beautiful structure holds.

Visual validation with noise using the precise_dist() partitions parameter

The best way to add noise is to do it from step 1. That is, we could add noise after the distance calculation, but that will lead to an information leak we can never plug up. Specifically, because we use the full amount of data to calculate the distances, even if we replace 20% of the data with noise, the rest of the data would still have been calculated before the 20% of noise was introduced, meaning that the 20% of original data we discard will still be present in the relationships the other 80% of the data defined. Thus, to add noise from the beggining, all we need to do is set the precise_dist() partitions parameter. In this first example we will set partitions = 10, meaning that 10% of the resulting distance will be noise. Also, note that we are inputing only a single distance function here instead of 10. This way, the output will be 10 distances, which matches the number of distances we calculated above:

cell_cycle_minkowski_params <- seq(0.5, 0.5, length.out = 1)
cell_cycle_minkowski_funcs <- precise_func_fact(
  func = "minkowski",
  params = cell_cycle_minkowski_params
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_minkowski_dists <- cell_cycle_data %>%
  as.matrix() %>%
  precise_dist(
    dist_funcs = cell_cycle_minkowski_funcs,
    time_series = FALSE,
    partitions = 10,
    suffix = "cell_minkowski_",
    file = "/absolute_path/to_somewhere/with_full_name/inclusing_the/file_extension.rds",
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(transform = "laplacian")
cell_cycle_minkowski_fused <- precise_fusion(
  cell_cycle_minkowski_transformed,
  fusion = "fuse",
  verbose = TRUE
)

Now we will visualize the output to see if adding 10% noise changed out results much. The code is the same as above, but being ommitted for asthetic purposes:

cell_cycle_minkowski_dists <- read_rds("/home/brian/Desktop/PreciseDist_Paper/Paper_Data/fighting_overfit_with_precise_dist/minkowski_1_10.rds")
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(transform = "laplacian")
cell_cycle_minkowski_fused <- precise_fusion(
  cell_cycle_minkowski_transformed,
  fusion = "fuse",
  verbose = FALSE
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 8)
graph_mm2 <- precise_graph(
  data = cell_cycle_minkowski_fused,
  method = 1,
  n_neighbors = 50,
  spread = 1,
  min_dist = 0.001,
  bandwidth = 10,
  parallel = TRUE,
  verbose = FALSE
)
viz_graph_mm2 <- precise_viz(
  data = graph_mm2,
  plot_type = "fr_3d_graph",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = FALSE
)
viz_graph_mm2$visual_output

As we can see, the structure still holds, although it looks a bit less compact. Now, let’s see what happens if we set partitions = 5, meaning that 20% of the resulting distance will be noise. In addition to setting partitions = 5, we will also input two distance functions into precise_dist(), so that the final output is 10 distances like above. Here is how we create two distances:

cell_cycle_minkowski_params <- seq(0.45, 0.55, length.out = 2)

And, here are the results (code being ommitted):

cell_cycle_minkowski_dists <- read_rds("/home/brian/Desktop/PreciseDist_Paper/Paper_Data/fighting_overfit_with_precise_dist/minkowski_1_5.rds")
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(transform = "laplacian")
cell_cycle_minkowski_fused <- precise_fusion(
  cell_cycle_minkowski_transformed,
  fusion = "fuse",
  verbose = FALSE
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 8)
graph_mm2 <- precise_graph(
  data = cell_cycle_minkowski_fused,
  method = 1,
  n_neighbors = 50,
  spread = 1,
  min_dist = 0.001,
  bandwidth = 10,
  parallel = TRUE,
  verbose = FALSE
)
viz_graph_mm2 <- precise_viz(
  data = graph_mm2,
  plot_type = "fr_3d_graph",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = FALSE
)
viz_graph_mm2$visual_output

Finally, let’s set partitions = 5, meaning that 20% of the resulting distance will be noise. We will also input five distance functions into precise_dist(), so that the final output is 10 distances like above. Here is how we create the five distances:

cell_cycle_minkowski_params <- seq(0.45, 0.55, length.out = 5)

And, here are the results (code being ommitted):

cell_cycle_minkowski_dists <- read_rds("/home/brian/Desktop/PreciseDist_Paper/Paper_Data/fighting_overfit_with_precise_dist/minkowski_1_2.rds")
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(transform = "laplacian")
cell_cycle_minkowski_fused <- precise_fusion(
  cell_cycle_minkowski_transformed,
  fusion = "fuse",
  verbose = FALSE
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 8)
graph_mm2 <- precise_graph(
  data = cell_cycle_minkowski_fused,
  method = 1,
  n_neighbors = 50,
  spread = 1,
  min_dist = 0.001,
  bandwidth = 10,
  parallel = TRUE,
  verbose = FALSE
)
viz_graph_mm2 <- precise_viz(
  data = graph_mm2,
  plot_type = "fr_3d_graph",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = FALSE
)
viz_graph_mm2$visual_output

Understandably, adding 50% noise has now destroyed our results. This is to be expected. What if we have already calculated our distances though, and we don’t want to recalculate them? Next, we will show you how to add noise to your distances after the calculations instead of during them.

Visual validation with noise using the precise_transform() add_noise parameter

This function does exactly what it sounds like, although instead of inputing the number of partitions we want to initially divide our data into, we input a numeric value between 0-1 representing the percentage of data we want to add noise to. Let’s set up the data like we did at the very beggining, where partitions = 1:

cell_cycle_minkowski_params <- seq(0.45, 0.54, length.out = 10)
cell_cycle_minkowski_funcs <- precise_func_fact(
  func = "minkowski",
  params = cell_cycle_minkowski_params
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_minkowski_dists <- cell_cycle_data %>%
  as.matrix() %>%
  precise_dist(
    dist_funcs = cell_cycle_minkowski_funcs,
    time_series = FALSE,
    partitions = 1,
    suffix = "cell_minkowski_",
    file = "/absolute_path/to_somewhere/with_full_name/inclusing_the/file_extension.rds",
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )

Now if we want to add 25% noise to each of the 10 output distances, we can simply run the following. Note that we run this function in parallel to speed it up:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(add_noise = 0.25, parallel = TRUE)

Here are the results:

What if we want to avoid the information leak we described above though? Although the distances have already been calculated, it is still possible to add noise after-the-fact without introducing information leak. The way to do this is to add new distances to the set of distances, which are pure noise. For example, if we wanted to add 50% noise to our original dataset, we can take that dataset and turn it to 100% noise by setting add_noise = 1:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_minkowski_dists_noise <- cell_cycle_minkowski_dists  %>%
  precise_transform(add_noise = 1, parallel = TRUE)

Now all we have to do is rbind() (or append() if everything is in list format) the original distances and the noise distances to have a new dataset that contains 50% noise:

cell_cycle_minkowski_dists_noise <- cell_cycle_minkowski_dists  %>%
  rbind(cell_cycle_minkowski_dists_noise()

Here are the results (code is being ommited):

Notice that the results show considerably more structure than when we added 50% noise by setting partitions = 2. This is important to note, and exists for several reasons. First, in all of these experiments we have been comparing red apples to green apples. That is, if we want the output to be 10 distances while changing the partitions parameter, we have to correspondingly input fewer functions into precise_dist(). Thus, in this instance, we fused 20 distances rather than our normal 10. More importantly though, even though we are adding noise, the noise is being integrated in the fusion step rather than in the the distance calculation step. Thus, the patterns in the original distances are complete, and in this case show their evident strength by not being completely destoyed by the addition of 10 extra distances of complete noise. Depending on what you are trying to accomplish, this can be seen as a negative or positive. One important thing to note now, however, is that adding noise after distance calculations is the recommended procedure when dealing with time distances. This is because the additional time component requires a more elaborate re-sampling procedure that also requires the dataset to be ordered by time, which is not neccesairly known.

Brian Muchmore

2018-09-26

Data set-up

Visual validation with noise using the precise_dist() partitions parameter

Visual validation with noise using the precise_transform() add_noise parameter

Contents